UNCLASSIFIED 


Defense  Technical  Information  Center 
Compilation  Part  Notice 

ADP013757 

TITLE:  Approximation  by  Perceptron  Networks 
DISTRIBUTION:  Approved  for  public  release,  distribution  unlimited 


This  paper  is  part  of  the  following  report: 

TITLE:  Algorithms  For  Approximation  IV.  Proceedings  of  the  2001 
International  Symposium 

To  order  the  complete  compilation  report,  use:  ADA412833 

The  component  part  is  provided  here  to  allow  users  access  to  individually  authored  sections 
of  proceedings,  annals,  symposia,  etc.  However,  the  component  should  be  considered  within 
the  context  of  the  overall  compilation  report  and  not  as  a stand-alone  technical  report. 

The  following  component  part  numbers  comprise  the  compilation  report: 

ADP013708  thru  ADPO 13761 


UNCLASSIFIED 


Approximation  by  perception  networks 

Vera  Kurkova 

Institute  of  Computer  Science,  Academy  of  Sciences  of  the  Czech  Republic 
Pod  vodarenskou  vezi  2,  P.0.  Box  5,  182  07  Prague  8,  Czechia 
veraScs.cas.cz 


1 Introduction 

The  classical  perceptron  proposed  by  Rosenblatt  [22]  as  a simplified  model  of  a neuron  computes  a 
weighted  sum  of  its  inputs  and  after  comparing  it  with  a threshold,  applies  an  activation  function 
representing  a rate  of  neuron  firing.  To  model  this  rate,  Rosenblatt  used  the  Heaviside  discontinu- 
ous threshold  function,  which  still  is,  together  with  its  various  continuous  approximations,  the  most 
widespread  type  of  activation  used  in  neurocomputing.  Formally,  a perceptron  with  the  Heaviside 
activation  function  computes  a characteristic  function  of  a half-space  of  7 Zd,  which  is  for  practical 
reasons  (all  inputs  are  bounded)  restricted  to  a box,  usually  [0,  l]rf.  Thus  theoretical  study  of  per- 
ceptron networks  leads  to  various  questions  concerning  approximation  of  functions  by  a special  class 
of  plane  waves  formed  by  linear  combinations  of  characteristic  functions  of  half-spaces  (correspond- 
ing to  the  simplest  model  of  perceptron  network  called  the  one-hidden-layer  network  with  a linear 
output  unit). 

Although  Rosenblatt’s  model  was  inspired  biologically,  plane  waves  (sometimes  called  ridge  func- 
tions) have  been  studied  for  a long  time  by  mathematicians  motivated  by  various  problems  from 
physics.  In  contrast  to  integration  theory,  where  functions  are  approximated  by  linear  combinations 
of  characteristic  functions  of  boxes  (simple  functions),  the  theory  of  perceptron  networks  studies 
approximation  of  multivariable  functions  by  linear  combinations  of  characteristic  functions  of  half- 
spaces. Expressions  in  terms  of  such  functions  exhibit  the  strength  and  weakness  of  plane  waves 
methods  described  by  Courant  and  Hilbert  [4],  page  676:  “But  always  the  use  of  plane  waves  fails  to 
exhibit  clearly  the  domains  of  dependence  and  the  role  of  characteristics.  This  shortcoming,  however, 
is  compensated  by  the  elegance  of  explicit  results.” 

In  this  paper  we  survey  our  recent  results  on  properties  of  approximation  by  linear  combinations 
of  characteristic  functions  of  half-spaces.  We  focus  on  existence  of  best  approximation,  impossibility 
of  choosing  among  best  approximations  a continuous  one,  estimates  of  rates  of  approximation  by 
linear  combinations  of  n characteristic  functions  of  half-spaces  and  integral  representation  as  a linear 
combination  of  a continuum  of  half-spaces. 


This  work  was  partially  supported  by  GA  CR  201/99/0092  and  201/02/0428. 
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2 Preliminaries 

A perceptron  with  an  activation  function  ip  : 71  — > 71  (where  7 Z denotes  the  set  of  real  numbers) 
computes  real-valued  functions  on  7Zd  x 7Zd+l  of  the  form  ip(v  ■ x + b),  where  x £ lZd  is  an  input 
vector,  v 6 7Zd  is  an  input  weight  vector  and  b £ 71  is  a bias. 

The  most  common  activation  functions  are  sigmoidals,  i.e.,  functions  with  an  ess-shaped  graph. 
Both  continuous  and  discontinuous  sigmoidals  are  used.  Here,  we  study  networks  based  on  the 
discontinuous  Heaviside  function  •&  defined  by  d(t)  = 0 for  t < 0 and  d(t)  = 1 for  t > 0.  Let  II  d 
denote  the  set  of  functions  on  [0,  l]d  computable  by  Heaviside  perceptrons,  i.e., 

Hd  = {/  : [0,  l]d-^Tl\ /(x)  = tf(v  ■ x + b),  v e 7^'^,  b £ 7 Z}. 

Notice  that  Hd  is  the  set  of  characteristic  functions  of  half-spaces  of  7Zd  restricted  to  [0,  l]d. 

For  all  positive  integers  d,  II d is  compact  in  (£p([0,  l]d),  ||.||p)  with  p £ [l,oo)  (see,  e.g.,  [8]).  This 
can  be  verified  easily  once  the  set  Hd  is  reparameterized  by  elements  of  the  unit  sphere  Sd  in  7 Zd+1. 
Indeed,  a function  $(v  -x  + b),  with  a non-zero  vector  (t>i, . . . , vd,  b)  £ 7Zd+1,  is  equal  to  $(v-x  + £>), 
where  (vi, ...  ,vd,b)  £ Sd  is  obtained  from  (iq, . . . , vd,  b)  £ 7Zd+]  by  normalization. 

The  simplest  type  of  multilayer  feedforward  network  has  one  hidden  layer  and  one  linear  output. 
Such  networks  with  Heaviside  perceptrons  in  the  hidden  layer  compute  functions  of  the  form 

n 

-XT  5), 

i= 1 

where  n is  the  number  of  hidden  units,  Wi  £71  are  output  weights  and  v,  £ 7Zd  and  bi  £ 71  are  input 
weights  and  biases,  respectively.  The  set  of  all  such  functions  is  the  set  of  all  linear  combinations  of 
n elements  of  Ha  and  is  denoted  by  span„Lfd- 

For  all  positive  integers  d,  Un6^/-+span„j Hd  (where  Af+  denotes  the  set  of  all  positive  integers)  is 
dense  in  (C([0,  l]d),  ||.||c),  the  linear  space  of  all  continuous  functions  on  [0,  l]d  with  the  supremum 
norm,  as  well  as  in  (£p([0,  l]d),  ||-||p)  with  p £ [l,oo]  (see,  e.g.,  [5,  9]). 

3 Existence  of  a best  approximation 

A subset  M of  a normed  linear  space  ( X , ||.||)  is  called  proximinal  if  for  every  f £ X the  distance 
||/  — M ||  = inffl£M  ||/  — 5 1|  is  achieved  for  some  element  of  M,  i.e.,  || / — M||  — minp6M  ||/  — <?||  (see, 
e.g.,  [23]).  Clearly,  a proximinal  subset  must  be  closed. 

A sufficient  condition  for  proximinality  of  a subset  M of  a normed  linear  space  (X,  ||.||)  is 
compactness  or  bounded  compactness.  However,  by  extending  Hd  into  span nHd  for  any  positive 
integer  n we  lose  compactness.  Nevertheless  compactness  can  be  replaced  by  a weaker  property 
that  requires  only  those  sequences  that  “minimize”  a distance  from  M of  an  element  of  X to  have 
convergent  subsequences.  More  precisely,  a subset  M of  a normed  linear  space  (X,  ||.||)  is  called 
approximative ly  compact  if  for  each  f £ X and  any  sequence  {<q  : i £ A f+)  C M such  that 
lim^oo  ||/  — <7j, ||  = ||/  — M||,  there  exists  g £ M such  that  {gL  : i £ A f+}  converges  subsequently 
to  g (see,  e.g.,  [23],  p.  368).  The  following  theorem  is  from  [16]. 

Theorem  3.1  For  all  n,d  positive  integers,  span nHd  is  an  approximatively  compact  subset  of 
(£p([0,  l]d,  ||.||p)  with  p £ [l,oo). 

The  proof  is  based  on  an  argument  showing  that  any  sequence  of  elements  of  spannHd  has  a 
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subsequence  that  either  converges  to  an  element  of  span nHd  or  to  a Dirac  delta  distribution,  and 
the  latter  case  cannot  occur  when  such  a sequence  “minimizes”  a distance  from  some  function  in 

£P([o:i]d). 

It  follows  directly  from  the  definitions  that  each  approximatively  compact  subset  is  proximinal. 

Corollary  3.2  For  all  n,  d positive  integers,  span nHa  is  a proximinal  subset  of  (£p([0,  l]d),  ||.||p) 
with  p E [l,oo). 

Thus,  for  any  fixed  number  n,  a function  in  £p([0,  l]rf)  has  a best  approximation  among  functions 
computable  by  a linear  combination  of  n characteristic  functions  of  half-spaces. 

4 Uniqueness  and  continuity  of  a best  approximation 

Let  M be  a subset  of  a normed  linear  space  (X,  ||.||)  and  let  V{M)  denote  the  set  of  all  subsets  of 
M.  The  set-valued  mapping  Pm  : X — > V(M)  defined  by  Pm{})  = {g  € M : \\f  - g\\  = ||/  - M ||} 
is  called  the  metric  projection  of  X onto  M and  Pm(J)  is  called  the  projection  of  f onto  M. 

Let  F : X — > V(M)  be  a set-valued  mapping.  A selection  from  F is  a mapping  f> : X — > M such 
that  for  all  / £ X,  </>(/)  £ F(f).  A mapping  <p  : X — > M is  called  a best  approximation  operator 
from  X to  M if  it  is  a selection  from  Pm- 

When  M is  proximinal,  then  Pm(J)  is  non-empty  for  all  / £ X and  so  there  exists  a best 
approximation  mapping  from  X to  M.  The  best  approximation  need  not  be  unique.  When  it  is 
unique,  M is  called  a Chebyshev  set  (or  “unicity”  set).  Thus  M is  Chebyshev  if  for  all  f £ X the 
projection  Pm(/)  is  a singleton. 

Recall  that  a normed  linear  space  ( X , ||.||)  is  called  strictly  convex  (also  called  “rotund”)  if  for 
all  / ^ g in  X with  ||/||  = ||<?||  = 1 we  have  ||(/  + g)/2||  < 1.  It  is  well  known  that  for  all  p £ (1,  oo), 
(£p([0,  l]rf),  || . ||p)  is  strictly  convex. 

The  following  theorem  from  [13]  implies  for  p in  the  open  interval  (1,  oo)  that  if  among  best 
approximations  to  spannf7d  (the  existence  of  which  is  guaranteed  by  Corollary  3.2)  there  is  a con- 
tinuous one,  then  spanniL<j  must  be  a Chebyshev  set. 

Theorem  4.1  In  a strictly  convex  normed  linear  space,  any  subset  with  a continuous  selection 
from  its  metric  projection  is  Chebyshev. 

We  shall  combine  this  theorem  with  the  following  geometric  characterization  of  Chebyshev  sets 
with  a continuous  best  approximation  from  [24], 

Theorem  4.2  In  a Banach  space  with  strictly  convex  dual,  every  Chebyshev  subset  with  continuous 
.metric  projection  is  convex. 

It  is  well  known  that  £p-spaces  with  p € (l,oc)  satisfy  the  assumptions  of  this  theorem  (since 
the  dual  of  Cp  is  Cq  where  1 /p+  l/q  = 1 and  q £ (l,oo))  (see,  e.g.,  [7],  p.  160).  Hence,  to  show  the 
non-existence  of  a continuous  selection,  it  is  sufficient  to  verify  that  span,,//,/  is  not  convex. 

Proposition  4.3  For  all  n,d  positive  integers,  span nHj  is  not  convex. 

Indeed,  consider  2 n parallel  half-spaces  with  the  characteristic  functions  <?,(x)  = ?l(v  • x + 6,), 
where  0 > &i  > . . . > &2n  > — 1 and  v = (1, 0,  ■ • • , 0)  £ Tld.  Then  | 9 » *s  a convex  combination 

of  two  elements  of  span,,//,;,  ]C”=1  g,  and  Y^iln+i  9 but  it  is  not  in  span,,//,;,  since  its  restriction 
to  the  one-dimensional  set  {(/,,  0, . . . ,0)  € Pd  : t € [0, 1]}  has  2 n discontinuities. 

Summarizing  results  of  this  section  and  the  previous  one,  we  get  the  following  corollary. 
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Corollary  4.4  In  (£p([0,  l]d),  ||.||p)  withp  G (l,oo)  for  all  n,d  positive  integers  there  exists  a best 
approximation  mapping  from  £p([0,  l]d)  to  span nHd,  but  no  such  mapping  is  continuous. 

Thus  convenient  properties  of  projection  operators  such  as  uniqueness  and  continuity  are  not 
satisfied  by  span nHd-  These  properties  would  allow  one  to  estimate  worst-case  errors  using  methods 
of  algebraic  topology  (see,  e.g.,  [6]).  In  linear  approximation  theory,  application  of  such  methods 
shows  that  some  sets  of  functions  defined  by  smoothness  conditions  exhibit  the  curse  of  dimension- 
ality: the  approximants  converge  at  rate  0(1/  ffn),  where  d is  the  number  of  variables  and  n is  the 
dimension  of  the  approximating  linear  space  (see,  e.g.,  [20]).  Our  results  show  that  these  arguments 
are  not  applicable  to  approximation  by  spannH^. 

5 Rates  of  approximation 

Let  ( X , ||.||)  be  a normed  linear  space  and  G be  its  subset,  then  G- variation  (variation  with  respect 
to  G)  is  defined  as  the  Minkowski  functional  of  the  set  cl  conv  (G  U — G),  i.e., 

II/Hg  = inf{e  € H+  : f/c  e cl  conv  (G  U -G)}. 

Variation  with  respect  to  G is  a norm  on  the  subspace  {/  G X : ||/||g  < oo)  C X.  The  closure  in 
its  definition  depends  on  the  topology  induced  on  X by  the  norm  ||.||.  When  X is  finite-dimensional, 
G- variation  does  not  depend  on  the  choice  of  a norm  on  X,  since  all  norms  on  a finite-dimensional 
space  are  topologically  equivalent. 

Variation  with  respect  to  G has  been  introduced  in  [17]  as  an  extension  of  the  concept  from  [1] 
of  //./-variation  called  variation  with  respect  to  half-spaces.  For  functions  of  one  variable,  variation 
with  respect  to  half-spaces  coincides,  up  to  a constant,  with  the  notion  of  total  variation  studied  in 
integration  theory  (see  [1]).  For  G countable  orthonormal,  it  coincides  with  /-norm  with  respect  to 
G (see  [18]). 

The  following  theorem  from  [17]  is  a reformulation  of  Maurey-Jones-Barron  Theorem  (see  [2], 
[10],  [21])  on  estimates  of  rates  of  approximation  of  the  order  of  0(1/ y/n). 

Theorem  5.1  Let  (X,  ||.||)  be  a Hilbert  space,  G be  its  subset  and  sg  = supg€G  ||</||.  Then  for  every 
f G X and  for  every  positive  integer  n, 

llf  _nnn  rl,  . , / (sg||/||g)2  - ll/ll2 
11/  - spannG||  < y — — — • 

Corollary  5.2  For  all  positive  integers  d,n  and  for  every  f G (^([0,  l]rf,  ||.||2), 

||/-spann//rfj|2<^^. 

Thus  worst-case  error  in  approximation  of  functions  from  the  unit  ball  in  //j- variation  by  linear 
combinations  of  characteristic  functions  of  n half-spaces  of  [0,  l]d  is  at  most  1 /y/n.  Estimates  derived 
from  Theorem  5.1  are  sometimes  called  “dimension-independent”,  which  is  misleading  since  with 
increasing  number  of  variables,  the  condition  of  being  in  the  unit  ball  in  G-variation  becomes 
more  and  more  constraining.  See  [19]  for  examples  of  smooth  functions  with  //j-variation  growing 
exponentially  with  the  number  of  variables  d.  However,  such  exponentially  growing  lower  bounds 
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on  variation  with  respect  to  half-spaces  are  merely  lower  bounds  on  upper  bounds  on  rates  of 
approximation  by  sp&nnHd,  they  do  not  prove  that  such  functions  cannot  be  approximated  with 
faster  rates  than  \\f\\Hd/\/n.  Finding  whether  these  exponentially  large  upper  bounds  are  tight 
seems  to  be  a difficult  task  related  to  some  open  problems  in  the  theory  of  complexity  of  Boolean 
circuits. 

Some  insight  into  behavior  of  77^- variation  gives  its  geometric  characterization  derived  in  [19] 
using  the  Hahn-Banach  Theorem. 

Theorem  5.3  Let  (X,  ||.||)  &e  a Hilbert  space  and  G be  its  nonempty  subset.  Then  for  every  f G X, 

II/IIg  = sup h€S  suff  h\  ’ where  S = ih  £ X ~ G±  : Hftll  = 1^ 

96  G 

Thus  functions  that  are  “almost  orthogonal”  to  Hd  (i.e.,  have  small  inner  products  with  char- 
acteristic functions  of  half-spaces)  have  large  77^- variation. 


6 Integral  representation 

The  following  theorem  from  [14]  shows  that  a smooth  real-valued  function  on  TZd  with  compact 
support  can  be  represented  as  an  integral  combination  of  characteristic  functions  of  half-spaces.  By 
b is  denoted  the  half-space  (x  € TZd  : e • x + b < 0} . 

Theorem  6.1  Let  d be  a positive  integer  and  let  f : 7ld  — » TZ  be  compactly  supported  and  d+2 -times 
continuously  differentiable.  Then 

/(x)  = / w f(e,  b) d(e  ■ x + b)dedb, 

Jsd-1xn 


where  for  d odd 


wf(e,b)  = ad 


f AfcV(y)dy, 

jHe,b 


kd  = (d+  l)/2,  and  a,d  is  a constant  independent  of  f,  while  for  d even, 

Wf(e,b)  = ad  Akd f(y)a(e  ■ y + b)dy, 

e,  b 

where  aft)  — —flog  |f|  + 1 for  t ^ 0 and  a(0)  = 0 , kd  = (d  + 2)/2,  and  ad  is  a constant  independent 

off- 

The  assumption  that  / is  compactly  supported  can  be  replaced  by  the  weaker  assumption  that  / 
vanishes  sufficiently  rapidly  at  infinity.  The  integral  representation  also  applies  to  certain  nonsmooth 
functions  that  generate  tempered  distributions. 

By  an  approach  reminiscent  of  Radon  transform  but  based  directly  on  distributional  techniques 
from  Courant  and  Hilbert  [4],  it  was  shown  in  [11]  that  if  / is  compactly  supported  function  on  TZd 
with  continuous  d- th  order  partial  derivatives,  where  d is  odd,  then  / can  be  represented  as 

/(x)  = / v/(e,b)d(e-'x.  + b)dedb, 

Jsd-lxn 
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where  Vf  = ad  fH  f)(y)dy , aa  — (— l)fe  1(1/2)(27t)1  d for  d = 2k +1,  f is  the  directional 

derivative  of  / in  the  direction  e iterated  d times,  de  is  the  (d  — l)-dimensional  volume  element  on 
S d~1,  and  dy  is  likewise  on  a hyperplane.  Although  the  coefficients  vj  are  obtained  by  integration 
over  hyperplanes,  while  the  Wf  arise  from  integration  over  half-spaces,  these  coefficients  can  be 
shown  to  coincide  by  an  application  of  the  Divergence  Theorem  [3]  p.423  to  the  half-spaces  H~b. 
Theorem  6.1  extends  the  representation  of  [11]  to  even  values  for  d and  target  functions  / which 
are  not  compactly  supported  but  which  decrease  sufficiently  rapidly  at  infinity. 

For  w 6 C\{Sd~l  x 71)  and  / G T>{lZd)  define 

Th(w)(x.)  = ( w(e,b)‘d(e-x.  + b)dedb, 

Jsd-1xnd 

SH(f)(e,b)  = wf{e,b). 

Theorem  6.1  shows  that  for  each  / € V(TZd),  X//(<S/r(/))  = /.  This  theorem  can  be  also  used  to 
estimate  variation  with  respect  to  half-spaces  by  the  Ti-norm  of  the  weighting  function  wj  = vj.  It 
is  shown  in  [11]  that  for  any  / to  which  the  above  representation  applies, 

\\f\\Hd  < [ \wf(e,b)\dedb. 

Jsd-1xnd 

Combining  this  upper  bound  on  Ha- variation  with  Corollary  5.2,  we  get  a smoothness  condition 
that  defines  sets  of  functions  that  can  be  approximated  by  spaxinHd  with  rates  of  the  order  of  1/y/n. 
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